Conversation
|
Wow, that is wild. |
tools/server/server-context.cpp
Outdated
| if (!ngram_mod && params_base.speculative.type == COMMON_SPECULATIVE_TYPE_NGRAM_MAP_MOD) { | ||
| ngram_mod = std::make_unique<common_ngram_mod>(params_base.speculative.ngram_size_n, 1024*1024); | ||
|
|
||
| params_base.speculative.ngram_mod = ngram_mod.get(); |
There was a problem hiding this comment.
I guess you have to do this way because unique_ptr doesn't accept forward declared struct?
If that's the case, probably using std::shared_ptr or std::optional can be a better hack
common/ngram-mod.h
Outdated
| void add(const int32_t * tokens); | ||
| int32_t get(const int32_t * tokens, int32_t offs) const; // return -1 if not found | ||
|
|
||
| uint16_t n; // ngram size to hash |
There was a problem hiding this comment.
in multiple places in the code, we need to cast this to size_t, so I think it's probably better to use size_t
| uint16_t n; // ngram size to hash | |
| size_t n; // ngram size to hash |
Can EOG/EOS token a good criteria? ngram can be different between user message and assistant message |
EOG/EOS seems way too often. The hash container can store a lot of ngram hashes (hundred thousands with the current size) before collisions start to occur. I'm thinking more about logic such as: if more than |
7ef5b95 to
a9a076f
Compare
common/ngram-mod.h
Outdated
| std::vector<common_ngram_mod_ext_entry> entries; | ||
| }; | ||
|
|
||
| using common_ngram_mod_ext_ptr = std::unique_ptr<common_ngram_mod_ext>; |
common/ngram-mod.h
Outdated
| std::vector<entry_t> entries; | ||
| }; | ||
|
|
||
| using common_ngram_mod_ptr = std::unique_ptr<common_ngram_mod>; |
|
Looks like llama-bench doesn't know about --spec-type ngram-mod param. |
|
It seems this PR has an additional positive side effect: in the case of GPT-OSS in high mode, when the model falls into a reasoning loop, it can now recover much faster. Token generation jumps to around 200, and the model even produces a meaningful result. |
|
@MikeLP This does not affect @characharm Yes, I also noticed that. Overall, I think this speculator can become enabled by default in |
Is this example for a MoE or a dense model? I have no intuitive feel for what constitutes 'small n' or a 'long draft'. I assume the optimal value depends on both model, model architecture and task at hand. |
|
On the same prompt (just to repeat in verbatim 200 lines of given source code) I sometimes see a draft acceptance rate of 0, while on most other runs it's 0.90+ on gpt-oss-120b with Below logs of a bad case followed by a good case. (I also observed a good case right after starting llama-server, so it's not like that the first request is always "bad"). Log |
|
I’m experimenting with the n/min/max settings, but I don’t understand the balance yet. Does a large min–max range hurt us somehow? Qwen 30B and settings from the post: DetailsI see accept: |
|
Larger ngram size and larger drafts increase the chances that we will draft only when the LLM is repeating an existing text. Basically, we are trying to detect long repeating blocks without doing exhaustive searches. So unless your use case involves such repeating blocks of text, this method won't help. Yes, the |
Do I understand correctly that
means that the total "cost" of ngram_mod was only about 5 ms? My point is: should I try to increase that time by changing --spec-ngram-size-n / --draft-min / --draft-max, since even 500 ms still wouldn’t be noticeable? |
|
So far, I think Regarding |
|
ad #19164 (comment) ) I think I just observed the effects of an early low acceptance streak (3) and the triggered reset clears the actually still very useful ngrams from prompt processing. Naively I would rather want to not clear the complete hash pool, but keep the ngrams from the prompt processing of the current request even if there are streaks? |
|
Just an idea to have a more consistent and sustained speedup behavior/avoid disadvantages of early low acceptance streaks in the current pruning mechanism: track for each ngram in the pool a capped score, initially set to 1 on insert. If an ngram was used successfully in a draft, count it up. If the draft was rejected count it down. On streaks remove all ngrams smaller or equal 0. Not sure if it's important to keep occupancy below a certain threshold. |
|
It works pretty well in OpenCode (GLM 4.7 Flash with thinking enabled), but I’m not sure if it’s real or placebo. I assume that a draft acceptance rate above 0.1 indicates some speedup. (I see also >0.5) |
|
To add to my message #19231 I think there still a problem. and after regenrate : |
This is one of the ideas leading to the vector |
I see the same thing. |
* spec : add ngram-mod * cont : simplify + keep track of occupancy * cont : cleanup * cont : move initialization to common/speculative * cont : cleanup * cont : cleanup * cont : fix
* spec : add ngram-mod * cont : simplify + keep track of occupancy * cont : cleanup * cont : move initialization to common/speculative * cont : cleanup * cont : cleanup * cont : fix
|
Not a bug, just something I noticed. When the prompt contains an uploaded/pasted file with CRLF line endings, the ngrams often don't get accepted (even if task is repeating file verbatim) because models prefer LF endings. Running with: llama-server -m Devstral-2-123B-Instruct-2512-UD-Q5_K_XL-00001-of-00002.gguf --no-mmap --temp 0.15 --port 55553 --metrics --min-p 0.01 -c 32768 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 32 --draft-max 48build: 7992 (612db61) with GNU 13.3.0 for Linux aarch64 Stats for CRLF file + prompt "Repeat verbatim." (temp set to 0 in UI) Stats for LF file + prompt "Repeat verbatim." (temp set to 0 in UI) |


cont #18471
Add basic ngram hasher for speculative decoding:
ntokens and pick the next token from the storageSome characteristics:
mis not fixed)Currently, a single hash pool is shared across all server slots, so different requests can benefit from each other.
Sample usage:
Applications:
Example:
spec-mod-0.mov
TODO: